Importing the Datasets

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
hosp = pd.read_csv('Hospitalisation details.csv')
medi = pd.read_csv('Medical Examinations.csv')
name = pd.read_excel('Names.xlsx')
In [3]:
hosp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2343 entries, 0 to 2342
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Customer ID    2343 non-null   object 
 1   year           2343 non-null   object 
 2   month          2343 non-null   object 
 3   date           2343 non-null   int64  
 4   children       2343 non-null   int64  
 5   charges        2343 non-null   float64
 6   Hospital tier  2343 non-null   object 
 7   City tier      2343 non-null   object 
 8   State ID       2343 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 164.9+ KB
In [4]:
medi.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2335 entries, 0 to 2334
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer ID             2335 non-null   object 
 1   BMI                     2335 non-null   float64
 2   HBA1C                   2335 non-null   float64
 3   Heart Issues            2335 non-null   object 
 4   Any Transplants         2335 non-null   object 
 5   Cancer history          2335 non-null   object 
 6   NumberOfMajorSurgeries  2335 non-null   object 
 7   smoker                  2335 non-null   object 
dtypes: float64(2), object(6)
memory usage: 146.1+ KB
In [5]:
name.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2335 entries, 0 to 2334
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Customer ID  2335 non-null   object
 1   name         2335 non-null   object
dtypes: object(2)
memory usage: 36.6+ KB

1. Collate the files so that all the information is in one place

In [6]:
table1 = pd.merge(hosp,medi, on = "Customer ID")
In [7]:
dataset = pd.merge(table1,name, on = "Customer ID")
In [8]:
dataset
Out[8]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name
0 Id2335 1992 Jul 9 0 563.84 tier - 2 tier - 3 R1013 17.580 4.51 No No No 1 No German, Mr. Aaron K
1 Id2334 1992 Nov 30 0 570.62 tier - 2 tier - 1 R1013 17.600 4.39 No No No 1 No Rosendahl, Mr. Evan P
2 Id2333 1993 Jun 30 0 600.00 tier - 2 tier - 1 R1013 16.470 6.35 No No Yes 1 No Albano, Ms. Julie
3 Id2332 1992 Sep 13 0 604.54 tier - 3 tier - 3 R1013 17.700 6.28 No No No 1 No Riveros Gonzalez, Mr. Juan D. Sr.
4 Id2331 1998 Jul 27 0 637.26 tier - 3 tier - 3 R1013 22.340 5.57 No No No 1 No Brietzke, Mr. Jordan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2330 Id5 1989 Jun 19 0 55135.40 tier - 1 tier - 2 R1012 35.530 5.45 No No No No major surgery yes Kadala, Ms. Kristyn
2331 Id4 1991 Jun 6 1 58571.07 tier - 1 tier - 3 R1024 38.095 6.05 No No No No major surgery yes Osborne, Ms. Kelsey
2332 Id3 1970 ? 11 3 60021.40 tier - 1 tier - 1 R1012 34.485 11.87 yes No No 2 yes Lu, Mr. Phil
2333 Id2 1977 Jun 8 0 62592.87 tier - 2 tier - 3 R1013 30.360 5.77 No No No No major surgery yes Lehner, Mr. Matthew D
2334 Id1 1968 Oct 12 0 63770.43 tier - 1 tier - 3 R1013 47.410 7.47 No No No No major surgery yes Hawks, Ms. Kelly

2335 rows × 17 columns

2. Check for missing values in the dataset

In [9]:
dataset.isnull().sum()
Out[9]:
Customer ID               0
year                      0
month                     0
date                      0
children                  0
charges                   0
Hospital tier             0
City tier                 0
State ID                  0
BMI                       0
HBA1C                     0
Heart Issues              0
Any Transplants           0
Cancer history            0
NumberOfMajorSurgeries    0
smoker                    0
name                      0
dtype: int64
In [ ]:
 
In [ ]:
 

3. Find the percentage of rows that have trivial value (for example, ?), and delete such rows if they do not contain significant information

In [10]:
trivial_rows = dataset[dataset == "?"].count(axis=1).sum()
In [11]:
trivial_rows
Out[11]:
11
In [12]:
total_rows = dataset.shape[0]
total_rows
Out[12]:
2335
In [13]:
percentage = (trivial_rows / total_rows) * 100
In [14]:
percentage
Out[14]:
0.47109207708779444
In [15]:
print("Percentage of trivial rows: {:.2f}%".format(percentage))
Percentage of trivial rows: 0.47%
In [16]:
dataset = dataset[dataset != "?"].dropna()
In [17]:
dataset.shape
Out[17]:
(2325, 17)

4. Use the necessary transformation methods to deal with the nominal and ordinal categorical variables in the dataset

In [18]:
dataset_cat = dataset.select_dtypes(exclude='number')
In [19]:
dataset_cat.columns
Out[19]:
Index(['Customer ID', 'year', 'month', 'Hospital tier', 'City tier',
       'State ID', 'Heart Issues', 'Any Transplants', 'Cancer history',
       'NumberOfMajorSurgeries', 'smoker', 'name'],
      dtype='object')
In [20]:
dataset['Heart Issues'].value_counts()
Out[20]:
No     1405
yes     920
Name: Heart Issues, dtype: int64
In [21]:
dataset['Any Transplants'].value_counts()
Out[21]:
No     2183
yes     142
Name: Any Transplants, dtype: int64
In [22]:
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
In [23]:
dataset["Heart Issues"]=le.fit_transform(dataset["Heart Issues"])
dataset["Any Transplants"]=le.fit_transform(dataset["Any Transplants"])
dataset["Cancer history"]=le.fit_transform(dataset["Cancer history"])
dataset["smoker"]=le.fit_transform(dataset["smoker"])
In [24]:
dataset["Heart Issues"].value_counts()
Out[24]:
0    1405
1     920
Name: Heart Issues, dtype: int64
In [25]:
dataset['Hospital tier'].value_counts()
Out[25]:
tier - 2    1334
tier - 3     691
tier - 1     300
Name: Hospital tier, dtype: int64
In [26]:
dataset['Hospital tier'] = dataset['Hospital tier'].str.replace('tier', '')
#df['col1'] = df['col1'].str.replace('example-', '')
In [27]:
dataset['Hospital tier'].value_counts()
Out[27]:
 - 2    1334
 - 3     691
 - 1     300
Name: Hospital tier, dtype: int64
In [28]:
dataset['Hospital tier'] = dataset['Hospital tier'].str.replace("-", "")
dataset['Hospital tier'] = dataset['Hospital tier'].astype(int)
In [29]:
dataset
Out[29]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name
0 Id2335 1992 Jul 9 0 563.84 2 tier - 3 R1013 17.580 4.51 0 0 0 1 0 German, Mr. Aaron K
1 Id2334 1992 Nov 30 0 570.62 2 tier - 1 R1013 17.600 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P
2 Id2333 1993 Jun 30 0 600.00 2 tier - 1 R1013 16.470 6.35 0 0 1 1 0 Albano, Ms. Julie
3 Id2332 1992 Sep 13 0 604.54 3 tier - 3 R1013 17.700 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr.
4 Id2331 1998 Jul 27 0 637.26 3 tier - 3 R1013 22.340 5.57 0 0 0 1 0 Brietzke, Mr. Jordan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2329 Id6 1962 Aug 4 0 52590.83 1 tier - 3 R1011 32.800 6.59 0 0 0 No major surgery 1 Baker, Mr. Russell B.
2330 Id5 1989 Jun 19 0 55135.40 1 tier - 2 R1012 35.530 5.45 0 0 0 No major surgery 1 Kadala, Ms. Kristyn
2331 Id4 1991 Jun 6 1 58571.07 1 tier - 3 R1024 38.095 6.05 0 0 0 No major surgery 1 Osborne, Ms. Kelsey
2333 Id2 1977 Jun 8 0 62592.87 2 tier - 3 R1013 30.360 5.77 0 0 0 No major surgery 1 Lehner, Mr. Matthew D
2334 Id1 1968 Oct 12 0 63770.43 1 tier - 3 R1013 47.410 7.47 0 0 0 No major surgery 1 Hawks, Ms. Kelly

2325 rows × 17 columns

In [30]:
dataset['City tier'] = dataset['City tier'].str.replace("tier", "")
dataset['City tier'] = dataset['City tier'].str.replace("-", "")
In [31]:
dataset
Out[31]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name
0 Id2335 1992 Jul 9 0 563.84 2 3 R1013 17.580 4.51 0 0 0 1 0 German, Mr. Aaron K
1 Id2334 1992 Nov 30 0 570.62 2 1 R1013 17.600 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P
2 Id2333 1993 Jun 30 0 600.00 2 1 R1013 16.470 6.35 0 0 1 1 0 Albano, Ms. Julie
3 Id2332 1992 Sep 13 0 604.54 3 3 R1013 17.700 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr.
4 Id2331 1998 Jul 27 0 637.26 3 3 R1013 22.340 5.57 0 0 0 1 0 Brietzke, Mr. Jordan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2329 Id6 1962 Aug 4 0 52590.83 1 3 R1011 32.800 6.59 0 0 0 No major surgery 1 Baker, Mr. Russell B.
2330 Id5 1989 Jun 19 0 55135.40 1 2 R1012 35.530 5.45 0 0 0 No major surgery 1 Kadala, Ms. Kristyn
2331 Id4 1991 Jun 6 1 58571.07 1 3 R1024 38.095 6.05 0 0 0 No major surgery 1 Osborne, Ms. Kelsey
2333 Id2 1977 Jun 8 0 62592.87 2 3 R1013 30.360 5.77 0 0 0 No major surgery 1 Lehner, Mr. Matthew D
2334 Id1 1968 Oct 12 0 63770.43 1 3 R1013 47.410 7.47 0 0 0 No major surgery 1 Hawks, Ms. Kelly

2325 rows × 17 columns

In [32]:
dataset['City tier'] = dataset['City tier'].astype(int)
In [33]:
dataset
Out[33]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name
0 Id2335 1992 Jul 9 0 563.84 2 3 R1013 17.580 4.51 0 0 0 1 0 German, Mr. Aaron K
1 Id2334 1992 Nov 30 0 570.62 2 1 R1013 17.600 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P
2 Id2333 1993 Jun 30 0 600.00 2 1 R1013 16.470 6.35 0 0 1 1 0 Albano, Ms. Julie
3 Id2332 1992 Sep 13 0 604.54 3 3 R1013 17.700 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr.
4 Id2331 1998 Jul 27 0 637.26 3 3 R1013 22.340 5.57 0 0 0 1 0 Brietzke, Mr. Jordan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2329 Id6 1962 Aug 4 0 52590.83 1 3 R1011 32.800 6.59 0 0 0 No major surgery 1 Baker, Mr. Russell B.
2330 Id5 1989 Jun 19 0 55135.40 1 2 R1012 35.530 5.45 0 0 0 No major surgery 1 Kadala, Ms. Kristyn
2331 Id4 1991 Jun 6 1 58571.07 1 3 R1024 38.095 6.05 0 0 0 No major surgery 1 Osborne, Ms. Kelsey
2333 Id2 1977 Jun 8 0 62592.87 2 3 R1013 30.360 5.77 0 0 0 No major surgery 1 Lehner, Mr. Matthew D
2334 Id1 1968 Oct 12 0 63770.43 1 3 R1013 47.410 7.47 0 0 0 No major surgery 1 Hawks, Ms. Kelly

2325 rows × 17 columns

5. The dataset has State ID, which has around 16 states. All states are not represented in equal proportions in the data. Creating dummy variables for all regions may also result in too many insignificant predictors. Nevertheless, only R1011, R1012, and R1013 are worth investigating further. Create a suitable strategy to create dummy variables with these restraints.

In [34]:
dataset['State ID'].value_counts()
Out[34]:
R1013    609
R1011    574
R1012    572
R1024    159
R1026     84
R1021     70
R1016     64
R1025     40
R1023     38
R1017     36
R1019     26
R1022     14
R1014     13
R1015     11
R1018      9
R1020      6
Name: State ID, dtype: int64
In [35]:
dataset['state_group'] = np.where((dataset['State ID'] == 'R1011') | (dataset['State ID'] == 'R1012') | (dataset['State ID'] == 'R1013'), dataset['State ID'], 'other')
In [36]:
state_dummies = pd.get_dummies(dataset['state_group'], prefix='state')
df = pd.concat([dataset, state_dummies], axis=1)
In [37]:
dataset
Out[37]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name state_group
0 Id2335 1992 Jul 9 0 563.84 2 3 R1013 17.580 4.51 0 0 0 1 0 German, Mr. Aaron K R1013
1 Id2334 1992 Nov 30 0 570.62 2 1 R1013 17.600 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P R1013
2 Id2333 1993 Jun 30 0 600.00 2 1 R1013 16.470 6.35 0 0 1 1 0 Albano, Ms. Julie R1013
3 Id2332 1992 Sep 13 0 604.54 3 3 R1013 17.700 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr. R1013
4 Id2331 1998 Jul 27 0 637.26 3 3 R1013 22.340 5.57 0 0 0 1 0 Brietzke, Mr. Jordan R1013
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2329 Id6 1962 Aug 4 0 52590.83 1 3 R1011 32.800 6.59 0 0 0 No major surgery 1 Baker, Mr. Russell B. R1011
2330 Id5 1989 Jun 19 0 55135.40 1 2 R1012 35.530 5.45 0 0 0 No major surgery 1 Kadala, Ms. Kristyn R1012
2331 Id4 1991 Jun 6 1 58571.07 1 3 R1024 38.095 6.05 0 0 0 No major surgery 1 Osborne, Ms. Kelsey other
2333 Id2 1977 Jun 8 0 62592.87 2 3 R1013 30.360 5.77 0 0 0 No major surgery 1 Lehner, Mr. Matthew D R1013
2334 Id1 1968 Oct 12 0 63770.43 1 3 R1013 47.410 7.47 0 0 0 No major surgery 1 Hawks, Ms. Kelly R1013

2325 rows × 18 columns

In [38]:
dataset=dataset[dataset["State ID"].isin(['R1011','R1012','R1013'])]
dataset.shape
dataset["State ID"]=le.fit_transform(dataset["State ID"])
dataset["State ID"].unique()
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
Out[38]:
array([2, 1, 0])
In [39]:
dataset['state_group'].value_counts()
Out[39]:
R1013    609
R1011    574
R1012    572
Name: state_group, dtype: int64
In [ ]:
 
In [40]:
dataset["state_group"].replace('R1011',1,inplace=True)
dataset["state_group"].replace('R1012',2,inplace=True)
dataset["state_group"].replace('R1013',3,inplace=True)
dataset["state_group"].replace('other',0,inplace=True)
/usr/local/lib/python3.7/site-packages/pandas/core/series.py:4582: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,
In [41]:
dataset
Out[41]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name state_group
0 Id2335 1992 Jul 9 0 563.84 2 3 2 17.58 4.51 0 0 0 1 0 German, Mr. Aaron K 3
1 Id2334 1992 Nov 30 0 570.62 2 1 2 17.60 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P 3
2 Id2333 1993 Jun 30 0 600.00 2 1 2 16.47 6.35 0 0 1 1 0 Albano, Ms. Julie 3
3 Id2332 1992 Sep 13 0 604.54 3 3 2 17.70 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr. 3
4 Id2331 1998 Jul 27 0 637.26 3 3 2 22.34 5.57 0 0 0 1 0 Brietzke, Mr. Jordan 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2328 Id7 1994 Oct 27 1 51194.56 1 3 0 36.40 6.07 0 0 0 No major surgery 1 Macpherson, Mr. Scott 1
2329 Id6 1962 Aug 4 0 52590.83 1 3 0 32.80 6.59 0 0 0 No major surgery 1 Baker, Mr. Russell B. 1
2330 Id5 1989 Jun 19 0 55135.40 1 2 1 35.53 5.45 0 0 0 No major surgery 1 Kadala, Ms. Kristyn 2
2333 Id2 1977 Jun 8 0 62592.87 2 3 2 30.36 5.77 0 0 0 No major surgery 1 Lehner, Mr. Matthew D 3
2334 Id1 1968 Oct 12 0 63770.43 1 3 2 47.41 7.47 0 0 0 No major surgery 1 Hawks, Ms. Kelly 3

1755 rows × 18 columns

6. The variable NumberOfMajorSurgeries also appears to have string values. Apply a suitable method to clean up this variable.

In [42]:
dataset["NumberOfMajorSurgeries"].replace('No major surgery',0,inplace=True)
/usr/local/lib/python3.7/site-packages/pandas/core/series.py:4582: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,
In [43]:
dataset
Out[43]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name state_group
0 Id2335 1992 Jul 9 0 563.84 2 3 2 17.58 4.51 0 0 0 1 0 German, Mr. Aaron K 3
1 Id2334 1992 Nov 30 0 570.62 2 1 2 17.60 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P 3
2 Id2333 1993 Jun 30 0 600.00 2 1 2 16.47 6.35 0 0 1 1 0 Albano, Ms. Julie 3
3 Id2332 1992 Sep 13 0 604.54 3 3 2 17.70 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr. 3
4 Id2331 1998 Jul 27 0 637.26 3 3 2 22.34 5.57 0 0 0 1 0 Brietzke, Mr. Jordan 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2328 Id7 1994 Oct 27 1 51194.56 1 3 0 36.40 6.07 0 0 0 0 1 Macpherson, Mr. Scott 1
2329 Id6 1962 Aug 4 0 52590.83 1 3 0 32.80 6.59 0 0 0 0 1 Baker, Mr. Russell B. 1
2330 Id5 1989 Jun 19 0 55135.40 1 2 1 35.53 5.45 0 0 0 0 1 Kadala, Ms. Kristyn 2
2333 Id2 1977 Jun 8 0 62592.87 2 3 2 30.36 5.77 0 0 0 0 1 Lehner, Mr. Matthew D 3
2334 Id1 1968 Oct 12 0 63770.43 1 3 2 47.41 7.47 0 0 0 0 1 Hawks, Ms. Kelly 3

1755 rows × 18 columns

7. Age appears to be a significant factor in this analysis. Calculate the patients' ages based on their dates of birth.

In [44]:
month_dict = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
In [45]:
dataset['month'] = dataset['month'].map(month_dict)
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [46]:
dataset
Out[46]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name state_group
0 Id2335 1992 7 9 0 563.84 2 3 2 17.58 4.51 0 0 0 1 0 German, Mr. Aaron K 3
1 Id2334 1992 11 30 0 570.62 2 1 2 17.60 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P 3
2 Id2333 1993 6 30 0 600.00 2 1 2 16.47 6.35 0 0 1 1 0 Albano, Ms. Julie 3
3 Id2332 1992 9 13 0 604.54 3 3 2 17.70 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr. 3
4 Id2331 1998 7 27 0 637.26 3 3 2 22.34 5.57 0 0 0 1 0 Brietzke, Mr. Jordan 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2328 Id7 1994 10 27 1 51194.56 1 3 0 36.40 6.07 0 0 0 0 1 Macpherson, Mr. Scott 1
2329 Id6 1962 8 4 0 52590.83 1 3 0 32.80 6.59 0 0 0 0 1 Baker, Mr. Russell B. 1
2330 Id5 1989 6 19 0 55135.40 1 2 1 35.53 5.45 0 0 0 0 1 Kadala, Ms. Kristyn 2
2333 Id2 1977 6 8 0 62592.87 2 3 2 30.36 5.77 0 0 0 0 1 Lehner, Mr. Matthew D 3
2334 Id1 1968 10 12 0 63770.43 1 3 2 47.41 7.47 0 0 0 0 1 Hawks, Ms. Kelly 3

1755 rows × 18 columns

In [47]:
dataset.year = dataset.year.astype(int)
/usr/local/lib/python3.7/site-packages/pandas/core/generic.py:5170: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
In [48]:
dataset['age'] = 2023 - dataset.year
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [49]:
dataset
Out[49]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name state_group age
0 Id2335 1992 7 9 0 563.84 2 3 2 17.58 4.51 0 0 0 1 0 German, Mr. Aaron K 3 31
1 Id2334 1992 11 30 0 570.62 2 1 2 17.60 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P 3 31
2 Id2333 1993 6 30 0 600.00 2 1 2 16.47 6.35 0 0 1 1 0 Albano, Ms. Julie 3 30
3 Id2332 1992 9 13 0 604.54 3 3 2 17.70 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr. 3 31
4 Id2331 1998 7 27 0 637.26 3 3 2 22.34 5.57 0 0 0 1 0 Brietzke, Mr. Jordan 3 25
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2328 Id7 1994 10 27 1 51194.56 1 3 0 36.40 6.07 0 0 0 0 1 Macpherson, Mr. Scott 1 29
2329 Id6 1962 8 4 0 52590.83 1 3 0 32.80 6.59 0 0 0 0 1 Baker, Mr. Russell B. 1 61
2330 Id5 1989 6 19 0 55135.40 1 2 1 35.53 5.45 0 0 0 0 1 Kadala, Ms. Kristyn 2 34
2333 Id2 1977 6 8 0 62592.87 2 3 2 30.36 5.77 0 0 0 0 1 Lehner, Mr. Matthew D 3 46
2334 Id1 1968 10 12 0 63770.43 1 3 2 47.41 7.47 0 0 0 0 1 Hawks, Ms. Kelly 3 55

1755 rows × 19 columns

In [ ]:
 

8. The gender of the patient may be an important factor in determining the cost of hospitalization. The salutations in a beneficiary's name can be used to determine their gender. Make a new field for the beneficiary's gender.

In [50]:
gender= ['0' if 'Mr.' in name else '1' for name in dataset['name']]
dataset["Gender"]=gender
dataset.head()
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
Out[50]:
Customer ID year month date children charges Hospital tier City tier State ID BMI HBA1C Heart Issues Any Transplants Cancer history NumberOfMajorSurgeries smoker name state_group age Gender
0 Id2335 1992 7 9 0 563.84 2 3 2 17.58 4.51 0 0 0 1 0 German, Mr. Aaron K 3 31 0
1 Id2334 1992 11 30 0 570.62 2 1 2 17.60 4.39 0 0 0 1 0 Rosendahl, Mr. Evan P 3 31 0
2 Id2333 1993 6 30 0 600.00 2 1 2 16.47 6.35 0 0 1 1 0 Albano, Ms. Julie 3 30 1
3 Id2332 1992 9 13 0 604.54 3 3 2 17.70 6.28 0 0 0 1 0 Riveros Gonzalez, Mr. Juan D. Sr. 3 31 0
4 Id2331 1998 7 27 0 637.26 3 3 2 22.34 5.57 0 0 0 1 0 Brietzke, Mr. Jordan 3 25 0

9. You should also visualize the distribution of costs using a histogram, box and whisker plot, and swarm plot.

In [51]:
plt.figure(figsize=(15,8))
sns.histplot(dataset['charges'])
plt.title('Distribution of cost')
plt.show()
In [52]:
plt.figure(figsize=(15,8))
sns.boxplot(dataset['charges'])
plt.title('Distribution of cost')
plt.show()
/usr/local/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
In [53]:
plt.figure(figsize=(15,8))
sns.swarmplot(dataset['charges'])
plt.title('Distribution of cost')
plt.show()
/usr/local/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

10. State how the distribution is different across gender and tiers of hospitals

In [54]:
plt.figure(figsize=(15,8))
sns.boxplot(x = 'charges', y = 'Gender',data = dataset)
plt.title('Distribution of cost')
plt.show()
In [55]:
plt.figure(figsize = (15,5))
sns.boxplot(x = "City tier",y = "charges", data = dataset)
plt.show()

11. Create a radar chart to showcase the median hospitalization cost for each tier of hospitals

In [56]:
median = dataset.groupby('Hospital tier')[['charges']].median().reset_index()
In [58]:
import plotly.express as px
fig = px.line_polar(median, r='charges', theta='Hospital tier', line_close=True)
fig.show()

12. Create a frequency table and a stacked bar chart to visualize the count of people in the different tiers of cities and hospitals

In [59]:
table = pd.crosstab(dataset['City tier'], dataset['Hospital tier'])
print(table)
Hospital tier   1    2    3
City tier                  
1              64  317  160
2              89  365  157
3              91  348  164
In [60]:
table.plot(kind='bar', stacked=True)
plt.xlabel('City tier')
plt.ylabel('Hospital tier')
plt.title('Count of People in Different Tiers of Cities and Hospitals')
plt.show()

13. Test the following null hypotheses:

a. The average hospitalization costs for the three types of hospitals are not significantly different

In [61]:
 import scipy.stats as stats
print('Null Hypothesis => Average hospitalization costs for the three types of hospitals are not significantly different.')
f_val, p_val = stats.f_oneway(dataset[dataset['Hospital tier'] == 'tier,1']['charges'],
                              dataset[dataset['Hospital tier'] == 'tier,2']['charges'],
                              dataset[dataset['Hospital tier'] == 'tier,3']['charges'])
print('P-value :',p_val)
if p_val < 0.05:
    print("Reject null hypothesis")
else:
    print("Accept null hypothesis")
Null Hypothesis => Average hospitalization costs for the three types of hospitals are not significantly different.
P-value : nan
Accept null hypothesis
/usr/local/lib/python3.7/site-packages/scipy/stats/stats.py:3333: RuntimeWarning:

Mean of empty slice.

/usr/local/lib/python3.7/site-packages/numpy/core/_methods.py:189: RuntimeWarning:

invalid value encountered in double_scalars

/usr/local/lib/python3.7/site-packages/scipy/stats/stats.py:3336: RuntimeWarning:

invalid value encountered in double_scalars

/usr/local/lib/python3.7/site-packages/scipy/stats/stats.py:3339: RuntimeWarning:

invalid value encountered in double_scalars

/usr/local/lib/python3.7/site-packages/scipy/stats/stats.py:3343: RuntimeWarning:

invalid value encountered in double_scalars

b. The average hospitalization costs for the three types of cities are not significantly different

In [62]:
print('Null Hypothesis => Average hospitalization costs for the three types of cities are not significantly different.')
f_val, p_val = stats.f_oneway(dataset[dataset['City tier'] == 'tier,1']['charges'],
                              dataset[dataset['City tier'] == 'tier,2']['charges'],
                              dataset[dataset['City tier'] == 'tier,3']['charges'])
print('P-value :',p_val)
if p_val < 0.05:
    print("Reject null hypothesis")
else:
    print("Accept null hypothesis")
Null Hypothesis => Average hospitalization costs for the three types of cities are not significantly different.
P-value : nan
Accept null hypothesis

c. The average hospitalization cost for smokers is not significantly different from the average cost for nonsmokers

In [63]:
print('Null Hypothesis => Average hospitalization costs for smokers is not significantly different from the average cost for nonsmokers.')
t_val, p_val = stats.ttest_ind(dataset[dataset['smoker'] == 'yes']['charges'],
                              dataset[dataset['smoker'] == 'no']['charges'])
                          
print('P-value :',p_val)
if p_val < 0.05:
    print("Reject null hypothesis")
else:
    print("Accept null hypothesis")
Null Hypothesis => Average hospitalization costs for smokers is not significantly different from the average cost for nonsmokers.
P-value : nan
Accept null hypothesis

d. Smoking and heart issues are independent

In [64]:
 from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(dataset['smoker'], dataset['Heart Issues'])
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f'P-value = {p}')
if p < 0.05:
    print("Reject the null hypothesis, Smoking and heart issues are independent.")
else:
    print("Accept null hypothesis, Smoking and heart issues are independent.")
P-value = 0.9107065371179246
Accept null hypothesis, Smoking and heart issues are independent.

2. Machine Learning

1. Examine the correlation between predictors to identify highly correlated predictors. Use a heatmap to visualize this.

In [65]:
dataset.drop(["Customer ID",'name'], inplace=True, axis=1)
/usr/local/lib/python3.7/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [66]:
plt.figure(figsize=(15,10))
sns.heatmap(dataset.corr(),square=True,annot=True,linewidths=1)
Out[66]:
<AxesSubplot:>

2. Develop and evaluate the final model using regression with a stochastic gradient descent optimizer. Also, ensure that you apply all the following suggestions:

• Perform the stratified 5-fold cross-validation technique for model building and validation • Use standardization and hyperparameter tuning effectively • Use sklearn-pipelines • Use appropriate regularization techniques to address the bias-variance trade-off

a. Create five folds in the data, and introduce a variable to identify the folds
b. For each fold, run a for loop and ensure that 80 percent of the data is used to train the model and the remaining 20 percent is used to validate it in each iteration
c. Develop five distinct models and five distinct validation scores (root mean squared error values)
d. Determine the variable importance scores, and identify the redundant variables
In [67]:
from sklearn.model_selection import train_test_split
In [68]:
x = dataset.drop(["charges"], axis=1)
y = dataset['charges']
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=.20,random_state=10)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)
from sklearn.linear_model import SGDRegressor
In [69]:
from sklearn.model_selection import GridSearchCV
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2,0.3,0.4,0.5,
0.6,0.7,0.8,0.9,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,
9.0,10.0,20,50,100,500,1000],
'penalty': ['l2', 'l1', 'elasticnet']}
sgd = SGDRegressor()
# Cross Validation
folds = 5
model_cv = GridSearchCV(estimator = sgd,
param_grid = params,
scoring = 'neg_mean_absolute_error',
cv = folds,
return_train_score = True,
verbose = 1)
model_cv.fit(x_train,y_train)
Fitting 5 folds for each of 84 candidates, totalling 420 fits
Out[69]:
GridSearchCV(cv=5, estimator=SGDRegressor(),
             param_grid={'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3,
                                   0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0,
                                   4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 20, 50,
                                   100, 500, 1000],
                         'penalty': ['l2', 'l1', 'elasticnet']},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)
In [70]:
model_cv.best_params_
Out[70]:
{'alpha': 100, 'penalty': 'l1'}
In [71]:
sgd = SGDRegressor(alpha= 100, penalty= 'l1')
In [72]:
sgd.fit(x_train, y_train)
Out[72]:
SGDRegressor(alpha=100, penalty='l1')
In [73]:
sgd.score(x_test, y_test)
Out[73]:
0.8788544308886965
In [74]:
 y_pred = sgd.predict(x_test)
In [75]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
sgd_mae = mean_absolute_error(y_test, y_pred)
sgd_mse = mean_squared_error(y_test, y_pred)
sgd_rmse = sgd_mse*(1/2.0)
In [76]:
 print("MAE:", sgd_mae)
print("MSE:", sgd_mse)
print("RMSE:", sgd_rmse)
MAE: 2841.7189516037624
MSE: 21617895.638213404
RMSE: 10808947.819106702
In [77]:
importance = sgd.coef_
pd.DataFrame(importance, index = x.columns, columns=['Feature_imp'])
Out[77]:
Feature_imp
year -1737.871184
month 0.000000
date 0.000000
children 230.852072
Hospital tier -1074.721106
City tier -113.384767
State ID 0.000000
BMI 2786.057774
HBA1C 0.000000
Heart Issues 0.000000
Any Transplants 0.000000
Cancer history 29.426534
NumberOfMajorSurgeries 0.000000
smoker 9490.780084
state_group 0.000000
age 1737.871184
Gender 0.000000

3. Use random forest and extreme gradient boosting for cost prediction, share your crossvalidation results, and calculate the variable importance scores

1.0.3 random forest

In [78]:
from sklearn.ensemble import RandomForestRegressor
In [79]:
 rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(x_train, y_train)
Out[79]:
RandomForestRegressor(n_estimators=1000, random_state=42)
In [80]:
 score = rf.score(x_test,y_test)
score
Out[80]:
0.9035185561992792
In [81]:
y_pred = rf.predict(x_test)
rf_mae = mean_absolute_error(y_test, y_pred)
rf_mae
Out[81]:
2088.5978950712283

1.0.4 extreme gradient boosting

In [82]:
from sklearn.ensemble import GradientBoostingRegressor
In [83]:
gbr = GradientBoostingRegressor(n_estimators = 1000, random_state = 42)
gbr.fit(x_train, y_train)
Out[83]:
GradientBoostingRegressor(n_estimators=1000, random_state=42)
In [84]:
score = gbr.score(x_test,y_test)
score
Out[84]:
0.8873166159703795
In [85]:
y_pred = gbr.predict(x_test)
gbr_mae = mean_absolute_error(y_test, y_pred)
gbr_mae
Out[85]:
2539.099317058944
  1. Case scenario: Estimate the cost of hospitalization for Christopher, Ms. Jayna (her date of birth is 12/28/1988, height is 170 cm, and weight is 85 kgs). She lives in a tier-1 city and her state’s State ID is R1011. She lives with her partner and two children. She was found to be nondiabetic (HbA1c = 5.8). She smokes but is otherwise healthy. She has had no transplants or major surgeries. Her father died of lung cancer. Hospitalization costs will be estimated using tier-1 hospitals.
In [107]:
dataset.columns
Out[107]:
Index(['year', 'month', 'date', 'children', 'charges', 'Hospital tier',
       'City tier', 'State ID', 'BMI', 'HBA1C', 'Heart Issues',
       'Any Transplants', 'Cancer history', 'NumberOfMajorSurgeries', 'smoker',
       'state_group', 'age', 'Gender'],
      dtype='object')
In [108]:
df= pd.DataFrame({ 'year' : [1998], 'month' : [12] ,'date': [28],
                      'city_tier' : [1], 'children' :[ 2],
                       'HbA1c' : [5.8], 
                       'smoker_yes' : [1],
                       'heart_issues_yes' : [0],
                       'any_transplants_yes' : [0],
                       'numberofmajorsurgeries' :[ 0],
                       'cancer_history_yes' : [1],
                       'hospital_tier' : [1],
                       'bmi' : [85/(1.70 **2)], 'age' : [25],'Gender' : [0],'state_group' : [1],
                       'state_id_R1011' : [1]
                      })
In [102]:
df
Out[102]:
year month date city_tier children HbA1c smoker_yes heart_issues_yes any_transplants_yes numberofmajorsurgeries cancer_history_yes hospital_tier bmi age Gender state_id_R1011
0 1998 12 28 1 2 5.8 1 0 0 0 1 1 29.411765 25 0 1

5. Find the predicted hospitalization cost using all five models. The predicted value should be the mean of the five models' predicted values'.

In [92]:
Hospital_cost = []
In [109]:
 Cost1 = sgd.predict(df)
Hospital_cost.append(Cost1)
Cost2 = rf.predict(df)
Hospital_cost.append(Cost2)
Cost3 = gbr.predict(df)
Hospital_cost.append(Cost3)
avg_cost = np.mean(Hospital_cost)
avg_cost
/usr/local/lib/python3.7/site-packages/sklearn/base.py:444: UserWarning:

X has feature names, but SGDRegressor was fitted without feature names

/usr/local/lib/python3.7/site-packages/sklearn/base.py:444: UserWarning:

X has feature names, but RandomForestRegressor was fitted without feature names

/usr/local/lib/python3.7/site-packages/sklearn/base.py:444: UserWarning:

X has feature names, but GradientBoostingRegressor was fitted without feature names

Out[109]:
-1048426.9438759197
In [ ]: